Load the necessary libraries
library(mgcv) #for GAMs
library(gratia) #for GAM plots
library(emmeans) #for marginal means etc
library(broom) #for tidy output
library(MuMIn) #for model selection and AICc
library(lubridate) #for processing dates
library(tidyverse) #for data wrangling
library(DHARMa) #for residuals diagnostics
library(performance) #for residual disagnostics
library(see) # to visualize residual diagnostics
library(patchwork) #for grids of plots
theme_set(theme_classic())
The Australian Institute of Marine Science (AIMS) have a long-term inshore marine water quality monitoring program in which water samples are collected and analysed from sites (reef_alias) across the GBR numerous times per year. The focus of this program is to report long-term condition and change in water quality parameters.
Although we do have latitude and longitudes, the nature of the spatial design predominantly reflects a series of transects that start near the mouth of a major river and extend northwards, yet mainly within the open coastal zone. As a result, this design is not well suited to any specific spatial analyses (since they are mainly one dimensional).
AIMS water quality monitoring
Format of aims.wq.csv data file
| latitude | longitude | reef_alias | water_samples | region | subregion | season | water_year | nox |
|---|---|---|---|---|---|---|---|---|
| -16.1 | 145. | Cape Trib… | AIMS | Wet T… | Barron D… | Dry | 2008 | 0.830 |
| -16.1 | 145. | Cape Trib… | AIMS | Wet T… | Barron D… | Wet | 2008 | 0.100 |
| -16.1 | 145. | Cape Trib… | AIMS | Wet T… | Barron D… | Dry | 2009 | 0.282 |
| -16.1 | 145. | Cape Trib… | AIMS | Wet T… | Barron D… | Wet | 2009 | 1.27 |
| -16.1 | 145. | Cape Trib… | AIMS | Wet T… | Barron D… | Dry | 2009 | 0.793 |
| -16.1 | 145. | Cape Trib… | AIMS | Wet T… | Barron D… | Dry | 2010 | 0.380 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| latitude | - Latitudinal coordinate |
| longitude | - Longitudinal coordinate |
| reef_alias | - Internal AIMS reef name |
| water_samples | - Categorical label of who collected the data |
| region | - The MMP region |
| subregion | - The MMP subregion |
| season | - A categorical listing of Wet or Dry |
| water_year | - The water year (1st Oct - 30 Sept) to which the data are attached |
| date | - The date the sample was collected |
| nox | - Nitrite + Nitrate added together |
wq <- read_csv('../data/aims.wq.csv', trim_ws=TRUE) %>%
janitor::clean_names() %>%
select(latitude, longitude, reef_alias, water_samples, region, subregion,
season, water_year, month = mnth, date, nox = n_ox)
wq <- wq %>%
mutate(reef_alias = factor(reef_alias),
region = factor(region),
subregion = factor(subregion),
season = factor(season))
glimpse(wq)
## Rows: 618
## Columns: 11
## $ latitude <dbl> -16.11152, -16.11383, -16.11261, -16.11308, -16.11343, …
## $ longitude <dbl> 145.4833, 145.4827, 145.4844, 145.4868, 145.4835, 145.4…
## $ reef_alias <fct> Cape Tribulation, Cape Tribulation, Cape Tribulation, C…
## $ water_samples <chr> "AIMS", "AIMS", "AIMS", "AIMS", "AIMS", "AIMS", "AIMS",…
## $ region <fct> Wet Tropics, Wet Tropics, Wet Tropics, Wet Tropics, Wet…
## $ subregion <fct> Barron Daintree, Barron Daintree, Barron Daintree, Barr…
## $ season <fct> Dry, Wet, Dry, Wet, Dry, Dry, Wet, Dry, Dry, Wet, Dry, …
## $ water_year <dbl> 2008, 2008, 2009, 2009, 2009, 2010, 2010, 2010, 2011, 2…
## $ month <dbl> 10, 3, 10, 2, 6, 10, 3, 6, 10, 2, 6, 10, 2, 6, 10, 2, 6…
## $ date <date> 2007-10-14, 2008-03-30, 2008-10-12, 2009-02-26, 2009-0…
## $ nox <dbl> 0.8300237, 0.0100000, 0.2819196, 1.2675291, 0.7934017, …
wq %>%
ggplot(aes(x = date, y = nox, col = subregion)) +
geom_point() +
facet_grid(season ~ region)
wq %>%
ggplot(aes(x = date, y = nox, col = subregion)) +
geom_point() +
facet_wrap( ~ reef_alias, scales = "free_y") +
geom_hline(yintercept=0) +
geom_smooth(col = "red") +
scale_y_continuous(trans = scales::pseudo_log_trans())
Model formula: \[ y_i \sim{} \mathcal{N}(\mu_i, \sigma^2)\\ \mu_i =\beta_0 + f(Date_i) + f(Month_i) \]
where \(\beta_0\) is the y-intercept. \(f(Date)\) and \(f(Month)\) indicate the additive smoothing functions of the long-term temporal trends and the annual seasonal trends respectively.
Use package lubridate to get dates into decimals as a numerical variable:
wq <- wq %>% mutate(dt = decimal_date(date))
We’ll start with a simple model only.
wq1 <- wq %>% filter(reef_alias == "Green") %>%
filter(complete.cases(nox))
wq1 %>% ggplot(aes(x = date, y = nox)) +
geom_smooth() +
scale_y_log10() +
scale_x_date(expand = expansion()) +
geom_point()
Maybe not so much this example, but other examples show that Gaussian may not work super well. However, we can try other distributions, such as a gamma (but replace all 0 detections with an arbitrarily small number), lognormal, or a tweedie.
wq_gam1 <- gam(nox ~ s(dt), data = wq1, family = gaussian, method = "REML")
# pseudo log-normal
wq_gam2 <- gam(nox ~ s(dt), data = wq1, family = gaussian(link = "log"),
method = "REML")
wq_gam3 <- gam(nox ~ s(dt), data = wq1, family = Gamma(link = "log"),
method = "REML")
wq_gam4 <- gam(nox ~ s(dt), data = wq1, family = tw(link = "log"),
method = "REML")
AICc(wq_gam1, wq_gam2, wq_gam3, wq_gam4) %>% arrange(AICc)
Gamma is best (wq_gam3), Tweedie is fairly similar, but on the basis of complexity, gamma is better.
k.check(wq_gam3) # not over-constrained
## k' edf k-index p-value
## s(dt) 9 3.07189 1.463834 1
appraise(wq_gam3) # looks ok overall.
concurvity(wq_gam3) # concurvity ok! Close to zero.
## para s(dt)
## worst 3.097166e-23 3.096820e-23
## observed 3.097166e-23 1.023266e-30
## estimate 3.097166e-23 1.116082e-25
The fact that the observed vs. predicted is not so tight isn’t necessarily surprising, considering we only have one predictor at present!
simulateResiduals(wq_gam3, plot=T)
## Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help.
##
## Scaled residual values: 0 0.992 0.076 0.752 0.632 0.56 0.224 0.288 0.288 0.28 0.88 0.184 0.744 0.512 0.624 0.784 0.412 0.552 0.72 0.856 ...
There’s a slight worry about non-linearity in the residuals, not that the bottom line is red, but rather, that they all trend towards going upwards. Not a show stopper, but definitely would like to improve on this prior to selecting a final model.
draw(wq_gam3, residuals = F)
What if there were different temporal trends for wet vs. dry seasons? This how you would model this…
wq_gam5 <- gam(nox ~ s(dt, by = season), data = wq1, family = Gamma(link = "log"),
method = "REML")
simulateResiduals(wq_gam5, plot = T)
## Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help.
##
## Scaled residual values: 0 0.964 0.204 0.896 0.512 0.632 0.308 0.384 0.26 0.268 0.988 0.112 0.712 0.748 0.56 0.676 0.684 0.492 0.644 0.988 ...
k.check(wq_gam5)
## k' edf k-index p-value
## s(dt):seasonDry 9 2.883745 1.230821 0.910
## s(dt):seasonWet 9 1.191068 1.230821 0.905
draw(wq_gam5)
What about monthly cycling, or annual cycling? We can look at this using a more fine distinction of month and day of the year effects, as a cyclical effect.
Need to tell it how many months the basis functions it is.
wq_gam6 <- gam(nox ~ s(dt) +
s(month, bs = 'cc', k = 5, fx = FALSE),
knots = list(month = seq(1,12, length = 5)),
data = wq1, family = Gamma(link = "log"),
method = "REML")
Fit a random effects model per reef:
wq_gam7 <- gam(nox ~ s(dt, by = region) +
s(month, bs = 'cc', by = region, k = 5) +
s(reef_alias, bs = "re"),
knots = list(month = seq(1,12, length = 5)),
family = Gamma(link = "log"),
data = wq, method = "REML")
k.check(wq_gam7)
## k' edf k-index p-value
## s(dt):regionBurdekin 9 6.298137e+00 0.7493435 0.0000
## s(dt):regionMackay Whitsunday 9 3.970266e+00 0.7493435 0.0000
## s(dt):regionWet Tropics 9 7.138526e+00 0.7493435 0.0000
## s(month):regionBurdekin 3 1.517644e+00 0.8689999 0.1200
## s(month):regionMackay Whitsunday 3 2.323597e+00 0.8689999 0.1300
## s(month):regionWet Tropics 3 7.884133e-04 0.8689999 0.1225
## s(reef_alias) 28 2.241402e+01 NA NA
Possibly more complexity in the date effect.
wq_gam8 <- gam(nox ~ s(dt, by = region, k = 20) +
s(month, bs = 'cc', by = region, k = 5) +
s(reef_alias, bs = "re"),
knots = list(month = seq(1,12, length = 5)),
family = Gamma(link = "log"),
data = wq, method = "REML")
k.check(wq_gam8)
## k' edf k-index p-value
## s(dt):regionBurdekin 19 6.887115e+00 0.7588022 0.0000
## s(dt):regionMackay Whitsunday 19 4.235643e+00 0.7588022 0.0000
## s(dt):regionWet Tropics 19 8.888716e+00 0.7588022 0.0000
## s(month):regionBurdekin 3 1.541980e+00 0.8727190 0.1300
## s(month):regionMackay Whitsunday 3 2.322221e+00 0.8727190 0.1150
## s(month):regionWet Tropics 3 6.593186e-04 0.8727190 0.1225
## s(reef_alias) 28 2.241898e+01 NA NA
summary(wq_gam8)
##
## Family: Gamma
## Link function: log
##
## Formula:
## nox ~ s(dt, by = region, k = 20) + s(month, bs = "cc", by = region,
## k = 5) + s(reef_alias, bs = "re")
##
## Parametric coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.1199 0.1464 0.819 0.413
##
## Approximate significance of smooth terms:
## edf Ref.df F p-value
## s(dt):regionBurdekin 6.887e+00 8.496 11.925 < 2e-16 ***
## s(dt):regionMackay Whitsunday 4.236e+00 5.263 3.620 0.00272 **
## s(dt):regionWet Tropics 8.889e+00 10.930 18.653 < 2e-16 ***
## s(month):regionBurdekin 1.542e+00 3.000 1.099 0.11036
## s(month):regionMackay Whitsunday 2.322e+00 3.000 9.877 6.07e-07 ***
## s(month):regionWet Tropics 6.593e-04 3.000 0.000 0.70889
## s(reef_alias) 2.242e+01 27.000 5.265 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## R-sq.(adj) = 0.243 Deviance explained = 47.6%
## -REML = 684.18 Scale est. = 1.1562 n = 568
Everything by date is wiggly, the Burdekin and Wet Tropics are not wiggly, but the Mackay/Whitsundays are wiggly!
draw(wq_gam8)
emmeans(wq_gam8, pairwise ~ dt, at = list(dt = c(2014,2016)), type = "response")
## $emmeans
## dt response SE df lower.CL upper.CL
## 2014 2.355 0.406 521 1.678 3.31
## 2016 0.933 0.115 521 0.732 1.19
##
## Results are averaged over the levels of: reef_alias, region
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
##
## $contrasts
## contrast ratio SE df null t.ratio p.value
## 2014 / 2016 2.52 0.475 521 1 4.921 <.0001
##
## Results are averaged over the levels of: reef_alias, region
## Tests are performed on the log scale
NOx is 2.52 times higher in 2014 vs. 2016!
emmeans(wq_gam8, pairwise ~ dt|region, at = list(dt = c(2014,2016)), type = "response") %>% confint()
## $emmeans
## region = Burdekin:
## dt response SE df lower.CL upper.CL
## 2014 1.923 0.634 521 1.006 3.67
## 2016 0.840 0.195 521 0.533 1.32
##
## region = Mackay Whitsunday:
## dt response SE df lower.CL upper.CL
## 2014 2.421 0.825 521 1.239 4.73
## 2016 1.080 0.276 521 0.653 1.79
##
## region = Wet Tropics:
## dt response SE df lower.CL upper.CL
## 2014 2.805 0.538 521 1.925 4.09
## 2016 0.895 0.118 521 0.690 1.16
##
## Results are averaged over the levels of: reef_alias
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
##
## $contrasts
## region = Burdekin:
## contrast ratio SE df lower.CL upper.CL
## 2014 / 2016 2.29 0.863 521 1.09 4.80
##
## region = Mackay Whitsunday:
## contrast ratio SE df lower.CL upper.CL
## 2014 / 2016 2.24 0.763 521 1.15 4.37
##
## region = Wet Tropics:
## contrast ratio SE df lower.CL upper.CL
## 2014 / 2016 3.13 0.755 521 1.95 5.03
##
## Results are averaged over the levels of: reef_alias
## Confidence level used: 0.95
## Intervals are back-transformed from the log scale
Burdekin: 2.29x Mackay-Whitsunday: 2.24x Wet Tropics: 3.13x
fx means that it does not force the model to have k=5, but just sets k to a max of 5. Defaults to fx = FALSE.